The objective is
To implement Cocke-Kasami-Younger (CKY) algorithm for bottom-up CFG parsing, and apply it to the word and the parsing problem of English.
Ingredients that we have to represent or manipulate using NLTK, are:
To run examples from the statement of the assignment you need atis grammar, if atis grammar is not available, execute nltk.download()
and select large_grammars in tab models and press "install".
The following line of the statement is wrong!
In [ ]:
# parse all test sentences
for sentence in s:
parser.chart_parse(sentence[0]
That's because s
is a long string with all the sentences joined. You have to use t
that is a list of tuples, where each element of the list corresponds to a sentence, a tuple whose first element is a list of words in the sentence.
In [ ]:
# parse all test sentences
for sentence in t:
try:
parser.chart_parse(sentence[0])
except ValueError:
pass
I explain this because is quite frustrating to try the most basic commands and that they don't work.
The try-except
statement it is necessary because some sentences had tokens that are not available in the grammar, this rise an exception that stops program execution. This statement prevents program execution stops.
This is a mistake in the assignment wording.
To make easier the execution, move atis-grammar-cnf.cfg
and atis-test-sentences.txt
to the same folder as the script. They are modified versions of files that can be downloaded using ntlk.download()
and that are located on nltk_data
folder of your home.
Files of atis large-grammar
are useful because atis_sentences.txt
files has the number of parse trees corresponding to each sentence, this information is handy to check our algorithms.
Well, let's start! Begin loading the grammar:
In [1]:
import nltk
g = nltk.data.load("atis-grammar-cnf.cfg")
In [2]:
print type(g)
g
Out[2]:
As the parameter of nltk.data.load()
has the .cfg
extension, NLTK read it as a grammar.
Then load the sentences:
In [3]:
s = nltk.data.load("atis_test_sentences_a3.txt", "raw")
s[:100]
Out[3]:
In [4]:
t = nltk.parse.util.extract_test_sentences(s)
t[:3]
Out[4]:
As you can see, the function nltk.parse.util.extract_test_sentences()
splits s
by \n
as separator of sentences, and then split each sentence using whitespace as separator of words.
None
values indicate that there are not information on whether the sentence is grammatical or not, or the number of parse trees that should have.
First, we create a production_table
that is a dict of lists such that the keys are productions right-hand-sides of the given grammar, and the values the corresponding production.
This a little trick that would be necessary for CKY
algoritms implementations.
In [5]:
from assignment3 import init_production_table
init_production_table??
In [6]:
pt = init_production_table(g)
print pt[('VERB_VB', 'BUM')]
In [ ]:
function CKY-PARSE (words, grammar) returns table
for j ← from 1 to LENGTH(words) do
table[j − 1, j] ← {A | A → words[j] ∈ grammar }
for i ← from j − 2 downto 0 do
for k ← i + 1 to j − 1 do
table[i,j] ← table[i, j] ∪
{A | A → BC ∈ grammar, B ∈ table[i, k], C ∈ table[k, j]}
Then the implementation is almost straightforward in Python using NLTK objects.
In [7]:
from assignment3 import cky_recognizer
cky_recognizer??
In cky_recognizer()
we start to define the table:
In [ ]:
n_tokens = len(tokens)
table = [[None for j in range(n_tokens + 1)] for i in range(n_tokens)]
"words" in the pseudocode corresponds to tokens
, and "grammar" corresponds to cnf_grammar
.
The loop
for j ← from 1 to LENGTH(words) do
is implemented as:
In [ ]:
for j in range(1, n_tokens + 1):
LENGTH(words)
corresponds to n_tokens
.
Remember that range(i, j)
goes from i
to j-1
, therefore is added one to n_tokens.
The expression:
table[j − 1, j] ← {A | A → words[j] ∈ grammar }
Means that left-hand-sides of the productions with word[j]
in right-hand-side that belong to the grammar are assigned at the position [j - 1, j]
of the table (the diagonal).
In [ ]:
productions = cnf_grammar.productions(rhs=tokens[j-1]) # productions with tokens[j-1] (word[j]) in rhs
lhs_list = [] # list of left-hand-side
for production in productions:
lhs_list.append(production.lhs())
table[j - 1][j] = lhs_list
The loop
for i ← from j − 2 downto 0 do
Corresponds to:
In [ ]:
for i in range(j - 2, -1, -1):
And
for k ← i + 1 to j − 1 do
table[i,j] ← table[i, j] ∪
{A | A → BC ∈ grammar,
B ∈ table[i, k],
C ∈ table[k, j]}
Means that left-hand-sides of the productions with B and C in right-hand-side that belong to the grammar, are assigned at the position [i, j]
of the table (upper right corner). Where, B and C are elements to the "right" and "below" of position [i, j]
respectively.
"the contens of two cells can be combined in a way that is sanctioned by a rule in the grammar. If such a rule exists, the non-terminal on its left-hand side is entered into the table." (Jurafsky & Martin, 2009)
In [ ]:
lhs_list = []
# Range over the places the string can be split
for k in range(i + 1, j):
right = table[i][k] # Right along row i
down = table[k][j] # Down along column j
for r_lhs in right:
for d_lhs in down:
key = (str(r_lhs), str(d_lhs))
if production_table.has_key(key):
for production in production_table[key]:
lhs_list.append(production.lhs())
Here you can see the production_table
trick. It's an easy way to obtain the productions that corresponds to a given pair of nonterminals.
Finally, the function returns the table.
In [ ]:
return table
To determine whether the sentence is in the CKY language we can use is_in_CFG_language()
function.
This function basically verifies that at least one nonterminal in the upper right corner of the table corresponds to the grammar start (in this case 'SIGMA'
).
In [8]:
from assignment3 import is_in_CFG_language
is_in_CFG_language??
Let's try 40 sentences from atis-test-sentences.txt
, that are selected in the file "atis_test_sentences_a3.txt"
. Sentences with more than 50 trees were avoided. There are 3 cases:
In [9]:
for i, sentence in enumerate(t):
table = cky_recognizer(sentence[0], g, pt)
if table != []:
if is_in_CFG_language(table, g):
print 'sentence {} is in language'.format(i + 1)
else:
print 'sentence {} not recognized'.format(i + 1)
else:
print 'sentence {} has tokens not covered by grammar'.format(i + 1)
Examples:
The main diference between the recognizer and the parser is that where before we had left sides of productions now have trees (of type nltk.Tree
).
Where we had in recognizer:
In [ ]:
productions = cnf_grammar.productions(rhs=tokens[j-1])
lhs_list = []
for production in productions:
lhs_list.append(production.lhs())
table[j - 1][j] = lhs_list
In parser we have:
In [ ]:
productions = cnf_grammar.productions(rhs=tokens[j-1])
trees = []
for production in productions:
trees.append(nltk.Tree(production.lhs(), [production.rhs()]))
table[j-1][j] = trees
Note that the tree is an structure such that the parent node corresponds to the production left-hand-side, and the children nodes corresponds to the production right-hand-side.
In a similar way, where in recognizer we had:
In [ ]:
lhs_list = []
# Range over the places the string can be split
for k in range(i + 1, j):
right = table[i][k] # Right along row i
down = table[k][j] # Down along column j
for r_lhs in right:
for d_lhs in down:
key = (str(r_lhs), str(d_lhs))
if production_table.has_key(key):
for production in production_table[key]:
lhs_list.append(production.lhs())
table[i][j] = lhs_list
Now, in parser we have:
In [ ]:
trees = []
# Range over the places the string can be split
for k in range(i + 1, j):
right = table[i][k] # Right along row i
down = table[k][j] # Down along column j
for r_tree in right:
for d_tree in down:
key = (str(r_tree.label()), str(d_tree.label()))
if production_table.has_key(key):
for production in production_table[key]:
trees.append(nltk.Tree(production.lhs(),
[r_tree, d_tree]))
table[i][j] = trees
In this last snippet of code, note that the magic is that r_tree
and d_tree
are inserted as children nodes, so that at the top level we have complete trees for the whole sentence.
In [ ]:
trees_found = []
for tree in table[0][n_tokens]:
if str(tree.label()) == str(cnf_grammar.start()):
trees_found.append(tree)
return trees_found
In this case, we return a list of trees, those which top node coincides with the grammar start ('SIGMA'
).
In [10]:
from assignment3 import cky_parser
cky_parser??
In [ ]:
f = open("atis_n_trees.txt", "w")
for sentence in t:
trees = cky_parser(sentence[0], g, pt)
f.write("{}\t{}\n".format(len(trees),' '.join(sentence[0])))
f.close()
In [ ]:
trees = cky_parser(t[5][0], g, pt)
for tree in trees:
tree.draw()
In [ ]: